Array expression profiling: direct inhibition of the NOTCH transcription complex

Goals

We will try to reproduce some of the differential expression results obtained by the paper Direct inhibition of the NOTCH transcription complex. In this paper, Moellering et al. try to design and synthesize a peptide able to inhibit NOTCH transcription factor for the treatment of individuals affected by Acute lymphoblastic leukemia (T-ALL).

NOTCH proteins regulate signaling pathways involved in cellular differentiation, proliferation and apoptosis. Overactive Notch signaling as been observed in numerous cancers and has been extensively studied in the context of T-ALL where more than 50% of pateints have mutant NOTCH1. Small molecule modulators of these proteins would be important for understanding the role of NOTCH proteins in malignant and normal biological processes.

The authors measure the global changes in gene expression upon treatment of the human T-ALL cell lines HPB-ALL and KOPT-K1 with either vehicle alone dimethylsulphoxide (DMSO) control or the designed peptite SAHM1, an alpha-helical hydrocarbon peptide derived from the MAML1 co-activator protein.

Summary design

They triplicate cultures of KOPT-K1 or HPB-ALL cells that were treated with either DMSO or SAHM1 (20 uM) for 24 hours. Total RNA was extracted and hybridized to Affymetrix human U133 plus 2.0 microarrays (three arrays per treatment per cell line for a total of 12 arrays).

Pipeline

Python imports

In [5]:
import rpy2.rinterface
%reload_ext rpy2.ipython

R imports

In [254]:
%%R
##1. Load libraries
library("affy")
library("limma")
library("genefilter")
library(simpleaffy)
library(hgu133plus2.db)
wd <- "/Users/nandoide/misc_work/Desktop/uni/TRREP"
setwd(wd)

Functions

In [281]:
%%R

import_CEL <- function(pattern) {
    # Import CEL files into affiBatch object
    files <- list.files(pattern = pattern)
    names <- gsub(".CEL.gz", "", files)
    abatch <- ReadAffy(filenames = files,  compress = TRUE, sampleNames = names)
    return(abatch)
}

create_eset <- function(affyBatch) {
    # Generates object eset (class ExprSet), 
    # expresso function provides intensities in log scale
    return(expresso(affyBatch,
             bg.correct = TRUE, 
             bgcorrect.method="rma",
             normalize = TRUE, 
             normalize.method="quantiles", 
             pmcorrect.method="pmonly", 
             summary.method="medianpolish",
             verbose = TRUE))
}

boxplots <- function(affyBatch, eset, title) {
    # Generate BOXPLOTS before and after normalization
    boxplot(affyBatch,
        main=paste0("Boxplot Before Normalization ", title),
        col = "lightgrey")
    df_eset <- as.data.frame(exprs(eset))
    
    boxplot(data.frame(df_eset),
        main=paste0("Boxplot After Normalization (log scale) ", title), col = "white")
}

create_TopTable <- function(eset, control_samples=c(1,1,1,0,0,0)) {
    # Generate Toptable with limma 
    
    # Data filtering using IQR.
    esetIQR <- varFilter(eset, var.func=IQR, var.cutoff=0.5, filterByQuantile=TRUE)

    # Differential expression analysis.#######
    r_control_samples <- 1 - control_samples
    design <- cbind(DMSO=control_samples, SAHM1=r_control_samples)

    rownames(design) <- colnames(eset)

    #7. Contrasts matrix.
    cont.matrix <- makeContrasts(DMSO_SAHM1 = SAHM1 - DMSO, levels = design)

    #8. Obtaining differentially expressed genes (DEGs)
    #Linear model and eBayes 
    fit <- lmFit(esetIQR, design)  ##getting DEGs from IQR 
    fit2 <- contrasts.fit(fit, cont.matrix)
    fit2 <- eBayes(fit2)

    #Table with DEGs results
    toptableIQR <- topTable(fit2, number=dim(exprs(esetIQR))[1], adjust.method="BH", sort.by="p")
    return(toptableIQR)
}

anotate_TopTable <- function(toptable) {
    # Obtain gene names from probe names and chip symbol dataset
    probenames_toptable <- as.character(rownames(toptable ))
    genesymbols_toptable <- as.character(mget(probenames_toptable, hgu133plus2SYMBOL))
    # Annotated gene table
    toptable_anot <- cbind(Symbol = genesymbols_toptable, toptable)
    return(toptable_anot)
}

generank_table <- function(toptable, rnk.file) {
    # Generate rank of table top 50 upregulared and top 50 downregulated from 250 better
    # adjustes p-values
    more_significant = toptable[order(toptable$adj.P.Val, decreasing = FALSE),][1:250,]
    up_50 = more_significant[which(toptable$logFC > 0), ] [1:50,] # up reg top 50
    down_50 = more_significant[which(toptable$logFC < 0), ] [1:50,] # down reg top 50

    print("Down-regulated genes")
    print(down_50[order(down_50$logFC), c(1,2,6)])

    print("Up-regulated genes")
    print(up_50[order(up_50$logFC), c(1,2,6)])

    d <- rbind(down_50[order(down_50$logFC), c(1,2,6)], up_50[order(up_50$logFC), c(1,2,6)])
   
    df <- data.frame(d$Symbol, d$logFC)
    write.table(df,row.names=FALSE,col.names=FALSE,
                quote=FALSE,sep="\t",file=paste0(rnk.file, ".rnk"))
}

Quality control

In [256]:
%%R

setwd("GSE18198_data")
affyBatch = import_CEL("*")
setwd(wd)
affyBatch_MAS5 <- call.exprs(affyBatch,"mas5")
qcs <- qc(affyBatch, affyBatch_MAS5)
plot(qcs)
qcs
An object of class "QCStats"
Slot "scale.factors":
 [1] 0.4624769 0.9923886 0.5749658 0.5299016 0.4725059 0.4445994 1.4066966
 [8] 1.2441075 1.3089160 1.9466663 2.1150710 2.3084671

Slot "target":
[1] 100

Slot "percent.present":
  HPB_DMSO_01.present   HPB_DMSO_02.present   HPB_DMSO_03.present 
             45.75583              41.61500              44.97485 
 HPB_SAHM1_01.present  HPB_SAHM1_02.present  HPB_SAHM1_03.present 
             45.39369              45.37906              46.93004 
 KOPT_DMSO_01.present  KOPT_DMSO_02.present  KOPT_DMSO_03.present 
             39.18793              39.72016              40.17558 
KOPT_SAHM1_01.present KOPT_SAHM1_02.present KOPT_SAHM1_03.present 
             38.46182              38.39049              38.57888 

Slot "average.background":
  HPB_DMSO_01   HPB_DMSO_02   HPB_DMSO_03  HPB_SAHM1_01  HPB_SAHM1_02 
     71.40948      51.82766      67.15371      73.25355      78.06755 
 HPB_SAHM1_03  KOPT_DMSO_01  KOPT_DMSO_02  KOPT_DMSO_03 KOPT_SAHM1_01 
     64.75102      58.67351      58.61665      56.23122      53.46125 
KOPT_SAHM1_02 KOPT_SAHM1_03 
     44.98804      45.99328 

Slot "minimum.background":
  HPB_DMSO_01   HPB_DMSO_02   HPB_DMSO_03  HPB_SAHM1_01  HPB_SAHM1_02 
     67.78476      49.90441      64.95388      68.98329      74.59893 
 HPB_SAHM1_03  KOPT_DMSO_01  KOPT_DMSO_02  KOPT_DMSO_03 KOPT_SAHM1_01 
     61.72838      56.40174      54.42714      53.83356      50.28008 
KOPT_SAHM1_02 KOPT_SAHM1_03 
     42.42914      43.87834 

Slot "maximum.background":
  HPB_DMSO_01   HPB_DMSO_02   HPB_DMSO_03  HPB_SAHM1_01  HPB_SAHM1_02 
     73.33486      52.82101      69.57489      75.70203      79.83810 
 HPB_SAHM1_03  KOPT_DMSO_01  KOPT_DMSO_02  KOPT_DMSO_03 KOPT_SAHM1_01 
     66.69588      60.24642      60.27544      57.78016      54.91569 
KOPT_SAHM1_02 KOPT_SAHM1_03 
     45.78716      46.82307 

Slot "spikes":
              AFFX-r2-Ec-bioB-3_at AFFX-r2-Ec-bioC-3_at AFFX-r2-Ec-bioD-3_at
HPB_DMSO_01               8.248482             9.704242             12.15884
HPB_DMSO_02               9.488241            11.021774             13.39970
HPB_DMSO_03               8.174298             9.589539             12.00550
HPB_SAHM1_01              8.381753             9.772156             12.21394
HPB_SAHM1_02              8.159541             9.712199             12.13458
HPB_SAHM1_03              7.898193             9.390437             11.90994
KOPT_DMSO_01              8.920814             8.512096             12.74309
KOPT_DMSO_02              9.007404             8.715061             12.82763
KOPT_DMSO_03              8.978651             8.593016             12.75616
KOPT_SAHM1_01             9.710773             9.355043             13.32068
KOPT_SAHM1_02            10.167828             9.646911             13.70326
KOPT_SAHM1_03            10.414865             9.875183             13.89010
              AFFX-r2-P1-cre-3_at
HPB_DMSO_01              13.20270
HPB_DMSO_02              14.44458
HPB_DMSO_03              13.18265
HPB_SAHM1_01             13.31337
HPB_SAHM1_02             13.19213
HPB_SAHM1_03             12.97417
KOPT_DMSO_01             14.04051
KOPT_DMSO_02             14.10125
KOPT_DMSO_03             14.00066
KOPT_SAHM1_01            14.63294
KOPT_SAHM1_02            14.82687
KOPT_SAHM1_03            15.01834

Slot "qc.probes":
              AFFX-HSAC07/X00351_3_at AFFX-HSAC07/X00351_5_at
HPB_DMSO_01                  12.73093                11.92493
HPB_DMSO_02                  13.63221                12.27401
HPB_DMSO_03                  12.77779                11.90235
HPB_SAHM1_01                 12.79605                11.63738
HPB_SAHM1_02                 12.66274                11.31439
HPB_SAHM1_03                 12.56844                11.59679
KOPT_DMSO_01                 13.53358                13.02196
KOPT_DMSO_02                 13.49021                13.00410
KOPT_DMSO_03                 13.50476                12.99276
KOPT_SAHM1_01                13.78372                13.04395
KOPT_SAHM1_02                13.85151                12.82967
KOPT_SAHM1_03                13.88275                12.84090
              AFFX-HSAC07/X00351_M_at AFFX-HUMGAPDH/M33197_3_at
HPB_DMSO_01                  12.19455                  12.87855
HPB_DMSO_02                  12.84310                  13.87735
HPB_DMSO_03                  12.24595                  12.89411
HPB_SAHM1_01                 12.17076                  12.98237
HPB_SAHM1_02                 11.91872                  12.87606
HPB_SAHM1_03                 11.97960                  12.69856
KOPT_DMSO_01                 13.21759                  13.80409
KOPT_DMSO_02                 13.20014                  13.72550
KOPT_DMSO_03                 13.16413                  13.78359
KOPT_SAHM1_01                13.33575                  14.13775
KOPT_SAHM1_02                13.31856                  14.20714
KOPT_SAHM1_03                13.29794                  14.34094
              AFFX-HUMGAPDH/M33197_5_at AFFX-HUMGAPDH/M33197_M_at
HPB_DMSO_01                    12.86110                  12.72697
HPB_DMSO_02                    13.80521                  13.68863
HPB_DMSO_03                    12.86872                  12.70080
HPB_SAHM1_01                   12.84679                  12.78306
HPB_SAHM1_02                   12.80222                  12.66854
HPB_SAHM1_03                   12.67308                  12.53523
KOPT_DMSO_01                   13.68232                  13.59723
KOPT_DMSO_02                   13.74016                  13.55294
KOPT_DMSO_03                   13.71389                  13.56799
KOPT_SAHM1_01                  14.11439                  13.93807
KOPT_SAHM1_02                  14.20615                  13.97154
KOPT_SAHM1_03                  14.29181                  14.05688

Slot "bioBCalls":
  HPB_DMSO_01.present   HPB_DMSO_02.present   HPB_DMSO_03.present 
                  "P"                   "P"                   "P" 
 HPB_SAHM1_01.present  HPB_SAHM1_02.present  HPB_SAHM1_03.present 
                  "P"                   "P"                   "P" 
 KOPT_DMSO_01.present  KOPT_DMSO_02.present  KOPT_DMSO_03.present 
                  "P"                   "P"                   "P" 
KOPT_SAHM1_01.present KOPT_SAHM1_02.present KOPT_SAHM1_03.present 
                  "P"                   "P"                   "P" 

Slot "arraytype":
[1] "hgu133plus2cdf"

Load raw data

In [222]:
%%R
setwd("GSE18198_data")
affyBatch_HPB = import_CEL("HPB*")
affyBatch_KOPT = import_CEL("KOPT*")
setwd(wd)

Create expression sets

In [223]:
%%R
eset_HPB <- create_eset(affyBatch_HPB)
eset_KOPT <- create_eset(affyBatch_KOPT)
background correction: rma 
normalization: quantiles 
PM/MM correction : pmonly 
expression values: medianpolish 
background correcting...done.
normalizing...done.
54675 ids to be processed
|                    |
|####################|
background correction: rma 
normalization: quantiles 
PM/MM correction : pmonly 
expression values: medianpolish 
background correcting...done.
normalizing...done.
54675 ids to be processed
|                    |
|####################|
In [224]:
%%R
save(eset_HPB, file="eset_HPB.RData")
save(eset_KOPT, file="eset_KOPT.RData")

Quality plots

In [225]:
%%R
boxplots(affyBatch_HPB, eset, "HPB Cell Line")
boxplots(affyBatch_KOPT, eset, "KOPT Cell Line")

Differential expressed genes

In [233]:
%%R
toptable_HPB <- create_TopTable(eset_HPB)
toptable_anot_HPB <- anotate_TopTable(toptable_HPB)
generank_table(toptable_anot_HPB, "generank_HPB")
[1] "Down-regulated genes"
               Symbol      logFC    adj.P.Val
225342_at         AK4 -2.1078858 1.128982e-05
230710_at    MIR210HG -1.9761492 7.688286e-06
227336_at        DTX1 -1.3841443 8.636324e-05
201842_s_at    EFEMP1 -1.3678548 6.004820e-05
204348_s_at       AK4 -1.3491612 3.350727e-04
227347_x_at      HES4 -1.2640712 1.228528e-04
200953_s_at     CCND2 -1.2357909 7.997079e-05
202464_s_at    PFKFB3 -1.1937490 1.228528e-04
202022_at       ALDOC -1.1753644 1.248092e-04
240546_at   LINC01120 -1.1131135 1.552628e-04
227337_at     ANKRD37 -1.1079168 2.110727e-04
200894_s_at     FKBP4 -1.0737202 5.300330e-04
217078_s_at    CD300A -1.0719092 8.239876e-04
201170_s_at   BHLHE40 -1.0402130 2.549012e-04
202934_at         HK2 -1.0377714 2.104071e-04
201848_s_at     BNIP3 -1.0052154 6.744018e-04
218051_s_at    NT5DC2 -1.0049990 2.851346e-03
203394_s_at      HES1 -0.9993973 7.294799e-04
219371_s_at      KLF2 -0.9800918 3.918155e-04
201251_at         PKM -0.9787897 4.328581e-03
201849_at       BNIP3 -0.9724936 6.167814e-04
213746_s_at      FLNA -0.9705676 1.430899e-03
201516_at         SRM -0.9428084 1.627743e-03
203523_at        LSP1 -0.9360851 2.263449e-03
225944_at         NLN -0.9317256 1.421369e-03
214183_s_at     TKTL1 -0.9290362 2.452591e-03
236180_at          NA -0.9287828 3.049302e-03
201194_at     SELENOW -0.9285463 6.167814e-04
231922_at      ZNF276 -0.9269927 4.098501e-03
209933_s_at    CD300A -0.9049911 8.239876e-04
214752_x_at      FLNA -0.9018607 1.636952e-03
226348_at       FUT11 -0.8976507 1.075839e-03
201212_at        LGMN -0.8972886 1.381530e-03
218305_at        IPO4 -0.8783786 1.923264e-03
205544_s_at       CR2 -0.8502424 1.206402e-03
202145_at        LY6E -0.8406636 2.534145e-03
200859_x_at      FLNA -0.8368746 2.851346e-03
202887_s_at     DDIT4 -0.8271607 1.813467e-03
200965_s_at    ABLIM1 -0.8229490 1.522219e-03
203504_s_at     ABCA1 -0.8211483 1.430899e-03
208116_s_at    MAN1A1 -0.7931451 4.037398e-03
202472_at         MPI -0.7831169 2.929531e-03
207543_s_at     P4HA1 -0.7783781 2.263449e-03
201563_at        SORD -0.7663457 4.281959e-03
222150_s_at      GSAP -0.7532244 4.098501e-03
207539_s_at       IL4 -0.7404704 4.098501e-03
205895_s_at     NOLC1 -0.7357033 4.328581e-03
219389_at       SUSD4 -0.7347418 3.852969e-03
201562_s_at      SORD -0.7343969 4.098501e-03
218984_at        PUS7 -0.7076878 4.281959e-03
[1] "Up-regulated genes"
                  Symbol     logFC    adj.P.Val
204962_s_at           NA 0.8016614 1.927646e-03
222670_s_at         MAFB 0.8314472 1.414249e-03
244075_at             NA 0.8555977 1.927646e-03
205047_s_at         ASNS 0.8566562 1.373571e-03
236153_at             NA 0.8615325 1.373571e-03
228999_at           CHD2 0.8640573 1.731300e-03
202847_at           PCK2 0.8839066 1.373571e-03
242388_x_at        TAGAP 0.8933876 1.116736e-03
243368_at             NA 0.9077303 1.731300e-03
1558212_at            NA 0.9381646 1.381530e-03
212907_at        SLC30A1 0.9572462 5.305252e-04
241505_at             NA 0.9630797 1.430899e-03
230659_at             NA 0.9742286 8.892420e-04
203279_at          EDEM1 0.9785129 3.944123e-04
218923_at           CTBS 0.9839137 1.116736e-03
1558920_at    SLC8A1-AS1 0.9839933 1.272470e-03
215071_s_at    HIST1H2AC 0.9867407 1.634677e-03
217678_at        SLC7A11 1.0002602 6.787587e-04
235795_at           PAX6 1.0066337 6.744018e-04
1556294_at         FXYD2 1.0219109 1.381530e-03
229538_s_at       IQGAP3 1.0315469 5.922793e-04
206864_s_at          HRK 1.0324569 1.969355e-03
243495_s_at       ZNF652 1.0513033 1.522219e-03
218145_at          TRIB3 1.0673820 2.149263e-04
219892_at         TM6SF1 1.0684116 7.475097e-04
244377_at         SLC1A4 1.0880295 2.042060e-04
201010_s_at        TXNIP 1.1062690 1.552628e-04
209921_at        SLC7A11 1.1333321 1.248092e-04
209822_s_at        VLDLR 1.1350680 2.104071e-04
230795_at             NA 1.1699268 2.104071e-04
213931_at             NA 1.1702639 3.804126e-04
201009_s_at        TXNIP 1.2328092 1.663743e-04
244042_x_at           NA 1.2501748 1.080676e-03
218559_s_at         MAFB 1.2553755 5.305252e-04
225957_at         CREBRF 1.2981952 4.382754e-04
222853_at          FLRT3 1.3254635 1.138577e-04
219270_at          CHAC1 1.3362724 3.651273e-05
202672_s_at         ATF3 1.3481402 3.651273e-05
218280_x_at           NA 1.3495011 1.272470e-03
207076_s_at         ASS1 1.4918960 3.651273e-05
201008_s_at        TXNIP 1.4920023 1.663743e-04
243871_at   LOC100130476 1.5055489 4.060560e-04
201464_x_at          JUN 1.5541848 7.688286e-06
236962_at             NA 1.5966520 1.522219e-03
235412_at        ARHGEF7 1.5978442 2.999663e-04
229541_at             NA 1.6143542 2.042060e-04
229147_at         RASSF6 1.6458323 7.688286e-06
235638_at         RASSF6 1.7804932 7.997079e-05
201466_s_at          JUN 2.1333931 2.108830e-06
201465_s_at          JUN 2.2743608 1.053822e-05
In [234]:
%%R
toptable_KOPT <- create_TopTable(eset_KOPT)
toptable_anot_KOPT <- anotate_TopTable(toptable_KOPT)
generank_table(toptable_anot_KOPT, "generank_KOPT")
[1] "Down-regulated genes"
               Symbol     logFC    adj.P.Val
209921_at     SLC7A11 -2.916468 2.066610e-11
205047_s_at      ASNS -2.757747 2.764706e-11
209369_at       ANXA3 -2.661459 2.066610e-11
219270_at       CHAC1 -2.264373 7.569090e-10
226517_at       BCAT1 -2.216157 2.899776e-10
214452_at       BCAT1 -2.166267 3.402665e-09
225285_at       BCAT1 -2.072317 2.630231e-10
217678_at     SLC7A11 -2.058710 7.569090e-10
230748_at     SLC16A6 -2.004791 6.729951e-09
219892_at      TM6SF1 -1.957664 1.012031e-09
220892_s_at     PSAT1 -1.938100 4.136414e-10
204351_at       S100P -1.902944 5.924315e-09
223195_s_at     SESN2 -1.889417 7.569090e-10
214079_at       DHRS2 -1.873374 5.924315e-09
209822_s_at     VLDLR -1.863734 4.214321e-09
212290_at      SLC7A1 -1.860735 2.786337e-08
202847_at        PCK2 -1.828534 1.164949e-09
225520_at          NA -1.790127 1.147212e-09
223062_s_at     PSAT1 -1.786577 1.164949e-09
223196_s_at     SESN2 -1.730910 3.463629e-08
200924_s_at    SLC3A2 -1.698282 2.186020e-08
1553972_a_at      CBS -1.682036 3.595785e-09
207076_s_at      ASS1 -1.675254 2.113541e-08
229787_s_at       OGT -1.662192 1.743772e-07
222632_s_at    LZTFL1 -1.643945 9.106004e-08
212816_s_at       CBS -1.599989 1.068633e-07
224839_s_at      GPT2 -1.553796 1.734549e-08
240983_s_at      CARS -1.552793 5.615021e-08
215411_s_at  TRAF3IP2 -1.540251 2.113541e-08
221539_at    EIF4EBP1 -1.533442 1.149078e-08
201195_s_at    SLC7A5 -1.511384 1.336464e-08
214390_s_at     BCAT1 -1.459915 2.424870e-07
204999_s_at      ATF5 -1.451634 2.113541e-08
210512_s_at     VEGFA -1.448871 5.654616e-08
200878_at       EPAS1 -1.435966 3.072051e-07
212501_at       CEBPB -1.428290 2.844253e-08
224580_at     SLC38A1 -1.428164 1.068633e-07
204744_s_at      IARS -1.419418 2.844253e-08
208693_s_at      GARS -1.406661 4.090508e-08
223059_s_at   FAM107B -1.392809 5.070038e-08
218437_s_at    LZTFL1 -1.332751 1.769993e-07
1558212_at         NA -1.308392 1.982077e-07
205653_at        CTSG -1.279331 2.502696e-07
217078_s_at    CD300A -1.278830 2.424870e-07
203627_at       IGF1R -1.275358 3.751987e-07
221933_at      NLGN4X -1.272235 4.113388e-07
231894_at        SARS -1.263321 2.424870e-07
214095_at       SHMT2 -1.235111 1.982077e-07
201263_at        TARS -1.197960 3.751987e-07
226181_at       TUBE1 -1.189891 3.344154e-07
[1] "Up-regulated genes"
                           Symbol     logFC    adj.P.Val
224429_x_at                    NA 0.9259213 6.376426e-06
220725_x_at                 DNAH3 0.9423323 6.447180e-06
210686_x_at              SLC25A16 0.9631964 4.491576e-06
213605_s_at                    NA 0.9632616 6.833714e-06
1556206_at              LINC00408 0.9653901 4.353339e-06
244114_x_at                    NA 0.9679377 6.750428e-06
241632_x_at                    NA 0.9686426 4.885508e-06
239017_at                      NA 0.9834736 7.316357e-06
1558496_at              LINC02053 0.9853247 5.091816e-06
211585_at                    NPAT 0.9926606 6.429684e-06
236389_x_at                    NA 1.0027588 4.491576e-06
208120_x_at                FKSG49 1.0184438 2.815681e-06
206323_x_at                 OPHN1 1.0212889 5.601191e-06
224284_x_at                FKSG49 1.0251308 3.950099e-06
201464_x_at                   JUN 1.0348832 3.490668e-06
220828_s_at             LINC01949 1.0397500 4.491576e-06
81737_at             LOC100505915 1.0462560 5.662707e-06
210800_at                  TIMM8A 1.0479516 4.730744e-06
215182_x_at                    NA 1.0559763 4.491576e-06
243489_at                      NA 1.0626878 4.070717e-06
240988_x_at                    NA 1.0629756 4.779299e-06
224288_x_at                FKSG49 1.0813463 1.113017e-06
AFFX-r2-Ec-bioB-5_at           NA 1.1020958 5.322529e-06
242862_x_at                    NA 1.1113007 1.113017e-06
1563674_at                  FCRL2 1.1127799 6.945728e-06
AFFX-BioC-5_at                 NA 1.1199525 6.430788e-06
210718_s_at                    NA 1.1396478 6.670158e-06
1562755_at                     NA 1.1425399 2.235176e-06
232964_at                      NA 1.1460961 1.656041e-06
220232_at                    SCD5 1.1545960 1.183928e-06
1569940_at                SLC6A16 1.1781696 8.464513e-07
211454_x_at                FKSG49 1.1800267 5.834686e-07
AFFX-BioB-M_at                 NA 1.1849198 6.376426e-06
209700_x_at               PDE4DIP 1.1871538 8.406263e-07
1566145_s_at                   NA 1.1918754 3.003518e-07
234949_at                  FRG1BP 1.1922760 1.231065e-06
1560144_at                     NA 1.2077566 1.790306e-06
1553185_at                  RASEF 1.2129105 6.160739e-07
231597_x_at                    NA 1.2169048 1.477715e-06
242619_x_at                    NA 1.2257718 4.302148e-07
1553186_x_at                RASEF 1.2330332 6.881976e-07
227952_at                  ZNF595 1.2408024 5.034675e-07
1561754_at                     NA 1.2457988 9.800827e-07
224159_x_at                 TRIM4 1.2471076 7.904835e-07
243689_s_at                FRG1BP 1.3748167 6.160739e-07
231598_x_at                    NA 1.4418759 1.469974e-07
228919_at                      NA 1.4589605 1.682330e-06
1562527_at              LOC441666 1.4936054 7.062932e-08
1558048_x_at                   NA 1.5005510 2.186020e-08
211565_at                  SH3GL3 1.5671795 2.844253e-08

Generate GSEA gct, cls files

To send raw data, we need to process the expression data from affiBatch:

exprs2_HPB <- cbind(featureNames(affyBatch_HPB), c(" "), exprs(affyBatch_HPB))
write.table(exprs2_HPB,row.names=FALSE,col.names=FALSE,quote=FALSE,file="eset_HPB.tsv", sep = "\t")
exprs2_KOPT <- cbind(featureNames(affyBatch_KOPT), c(" "), exprs(affyBatch_KOPT))
write.table(exprs2_KOPT,row.names=FALSE,col.names=FALSE,quote=FALSE,file="eset_KOPT.tsv", sep = "\t")

We decide to compute from expression set generated by expresso, in order to consume less processing time in GSEA. But we need to transform the expression amount to no log quantities.

In [269]:
%%R
exprs2_HPB <- cbind(c(" "), 2 ** exprs(eset_HPB))
write.table(exprs2_HPB,row.names=TRUE,col.names=FALSE,quote=FALSE,file="eset_HPB.tsv", sep = "\t")
exprs2_KOPT <- cbind(c(" "), 2 ** exprs(eset_KOPT))
write.table(exprs2_KOPT,row.names=TRUE,col.names=FALSE,quote=FALSE,file="eset_KOPT.tsv", sep = "\t")
In [270]:
%%bash
echo "#1.2" > gct.head.HPB
echo "$(cat eset_HPB.tsv | wc -l) 6" >> gct.head.HPB
echo "GID	NAME	HPB_DMSO_01	HPB_DMSO_02	HPB_DMSO_03	HPB_SAHM1_01	HPB_SAHM1_02	HPB_SAHM1_03" >> gct.head.HPB

echo "#1.2" > gct.head.KOPT
echo "$(cat eset_KOPT.tsv | wc -l) 6" >> gct.head.KOPT
echo "GID	NAME	KOPT_DMSO_01	KOPT_DMSO_02	KOPT_DMSO_03	KOPT_SAHM1_01	KOPT_SAHM1_02	KOPT_SAHM1_03" >> gct.head.KOPT

cat gct.head.HPB eset_HPB.tsv > eset_HPB.gct
cat gct.head.KOPT eset_KOPT.tsv > eset_KOPT.gct

echo "6	2	1" > phenotypes.cls
echo "#DMSO SAHM1" >> phenotypes.cls
echo "0	0	0	1	1	1" >> phenotypes.cls

Processing all samples

In [214]:
%%R
setwd("GSE18198_data")
affyBatch = import_CEL("*")
setwd(wd)
eset <- create_eset(affyBatch)
toptable <- create_TopTable(eset, control_samples=c(1,1,1,0,0,0,1,1,1,0,0,0))
toptable_anot <- anotate_TopTable(toptable)
generank_table(toptable_anot, "generank")
save(eset, file="eset.RData")
background correction: rma 
normalization: quantiles 
PM/MM correction : pmonly 
expression values: medianpolish 
background correcting...done.
normalizing...done.
54675 ids to be processed
|                    |
|####################|
[1] "Down-regulated genes"
               Symbol      logFC      P.Value
227347_x_at      HES4 -1.2391846 4.541483e-05
227336_at        DTX1 -1.0787341 1.772044e-03
230263_s_at     DOCK5 -1.0589452 2.174812e-04
218051_s_at    NT5DC2 -0.9507155 1.126824e-03
205544_s_at       CR2 -0.9438572 5.961088e-05
202464_s_at    PFKFB3 -0.9408542 1.077045e-04
226452_at        PDK1 -0.8325181 2.019305e-05
223364_s_at     DHX37 -0.8207408 1.605856e-03
206686_at        PDK1 -0.8172945 9.936732e-04
203627_at       IGF1R -0.8050123 1.277490e-03
203867_s_at      NLE1 -0.7963720 7.746991e-04
204513_s_at     ELMO1 -0.7885410 1.839908e-04
207543_s_at     P4HA1 -0.7663495 3.287902e-05
212063_at        CD44 -0.7621344 1.769410e-03
227337_at     ANKRD37 -0.7620441 3.022164e-04
239410_at         HK2 -0.7615742 2.557385e-03
1554918_a_at    ABCC4 -0.7263246 1.912084e-03
231094_s_at        NA -0.7232606 9.131629e-04
200965_s_at    ABLIM1 -0.7085777 6.009393e-04
210625_s_at     AKAP1 -0.6987526 1.183409e-04
231310_at      TRIM71 -0.6679915 6.024183e-04
219253_at    TMEM185B -0.6494920 1.313180e-03
215195_at       PRKCA -0.6258312 2.255091e-04
228205_at         TKT -0.6126859 8.915350e-04
218806_s_at      VAV3 -0.5936019 2.050231e-03
206923_at       PRKCA -0.5924819 2.694018e-04
201367_s_at   ZFP36L2 -0.5879670 1.331411e-03
236180_at          NA -0.5535885 1.051495e-03
221989_at          NA -0.5531493 1.744579e-04
223058_at     FAM107B -0.5499447 2.198129e-03
208858_s_at     ESYT1 -0.5411513 2.844208e-04
1555434_a_at SLC39A14 -0.5382917 2.440482e-03
1553138_a_at   ANKLE1 -0.5306391 5.104033e-04
227099_s_at  C11orf96 -0.5289498 2.627354e-03
203612_at        BYSL -0.5257364 1.818973e-04
226938_at       DCAF4 -0.5161954 1.874088e-03
214484_s_at   SIGMAR1 -0.5153579 1.465927e-03
206653_at      POLR3G -0.5067532 2.177288e-03
226498_at        FLT1 -0.5041981 2.319967e-03
208758_at        ATIC -0.4943349 1.053895e-03
208997_s_at      UCP2 -0.4905291 2.479831e-03
209461_x_at     WDR18 -0.4862286 8.241993e-04
201161_s_at      YBX3 -0.4839871 9.730003e-04
225883_at     ATG16L2 -0.4745402 2.113101e-03
201692_at     SIGMAR1 -0.4727785 1.380736e-03
217139_at       VDAC1 -0.4553310 2.588750e-03
229236_s_at     SFXN4 -0.4251656 1.858039e-03
224824_at       COX20 -0.4173264 2.375617e-03
204027_s_at    METTL1 -0.4116541 2.046282e-03
201250_s_at    SLC2A1 -0.3947427 2.225655e-03
[1] "Up-regulated genes"
                   Symbol     logFC      P.Value
232059_at         DSCAML1 0.3346000 0.0068497056
209392_at           ENPP2 0.3540105 0.0052561029
205381_at          LRRC17 0.3645748 0.0064690771
214710_s_at         CCNB1 0.3671989 0.0073863284
1569680_at             NA 0.3814353 0.0033810217
1559023_a_at      EFCAB14 0.3845018 0.0052968042
236353_at              NA 0.3952903 0.0082999985
206448_at          ZNF365 0.4055256 0.0032538740
226936_at           CENPW 0.4121128 0.0055882829
243469_at              NA 0.4133233 0.0048561664
201896_s_at         PSRC1 0.4219794 0.0084642203
1568596_a_at        TROAP 0.4261581 0.0017179111
238875_at              NA 0.4288226 0.0078367183
242966_x_at          RFX2 0.4321379 0.0027616154
236253_at          ZNF546 0.4657924 0.0075436998
1557290_at             NA 0.4742477 0.0041564711
243992_at              NA 0.4761079 0.0040639088
204641_at            NEK2 0.4770373 0.0034259514
220167_s_at            NA 0.4845848 0.0045882849
241685_x_at          PURA 0.4879466 0.0058069390
239735_at              NA 0.4897705 0.0030891034
202644_s_at       TNFAIP3 0.4934166 0.0004891236
242476_at              NA 0.5030038 0.0056605223
238595_at              NA 0.5058669 0.0059647137
213605_s_at            NA 0.5068902 0.0050211914
244427_at           KIF23 0.5209074 0.0015710943
242637_at              NA 0.5222779 0.0047282688
232953_at     LINC00266-1 0.5259613 0.0041673250
1559156_at             NA 0.5325258 0.0057922093
228390_at           RAB30 0.5332846 0.0071758418
238407_at              NA 0.5390014 0.0066182068
216756_at              NA 0.5406353 0.0019487381
239248_at      SDCBP2-AS1 0.5441412 0.0010999281
213544_at            ING2 0.5513826 0.0046805162
239531_at              NA 0.5536821 0.0026133238
243709_at         SLC38A9 0.5569113 0.0009694726
241745_at    LOC100507557 0.5600510 0.0027569686
1557813_at             NA 0.5691647 0.0029037060
215599_at              NA 0.5931984 0.0012579337
244114_x_at            NA 0.6411913 0.0034400322
213281_at             JUN 0.6419108 0.0061333797
228834_at            TOB1 0.6427144 0.0006729806
244532_x_at            NA 0.6470301 0.0046383310
230795_at              NA 0.6509000 0.0039752814
216094_at              NA 0.6647132 0.0040965858
234759_at    LOC100287497 0.6758224 0.0008985553
244075_at              NA 0.6914067 0.0071383024
215071_s_at     HIST1H2AC 0.7357238 0.0038095135
210718_s_at            NA 0.7879121 0.0013751119
201465_s_at           JUN 1.5137055 0.0061879051
In [271]:
%%R
# Create files for GSEA
exprs2 <- cbind(c(" "), 2 ** exprs(eset))
write.table(exprs2, row.names=TRUE,col.names=FALSE, quote=FALSE,file="eset.tsv", sep = "\t")
In [272]:
%%bash
echo "#1.2" > gct.head
echo "$(cat eset.tsv | wc -l) 12" >> gct.head
echo "GID	NAME	HPB_DMSO_01	HPB_DMSO_02	HPB_DMSO_03	HPB_SAHM1_01	HPB_SAHM1_02	HPB_SAHM1_03	KOPT_DMSO_01	KOPT_DMSO_02	KOPT_DMSO_03	KOPT_SAHM1_01	KOPT_SAHM1_02	KOPT_SAHM1_03" >> gct.head
#head gct.head
cat gct.head eset.tsv > eset.gct

echo "12	2	1" > phenotypes_all.cls
echo "#DMSO SAHM1" >> phenotypes_all.cls
echo "0	0	0	1	1	1	0	0	0	1	1	1" >> phenotypes_all.cls

GSEA results

See figures 1 to 8.

Enrichment plot Notch Signalling Pathway

Gene sets enriched in phenotype DMSO(Cell Line HPB-ALL)

Heatmap(Cell Line HPB-ALL)

Gene sets enriched in phenotype DMSO(Cell Line KOPT-K1)

Heatmap(Cell Line KOPT-K1)

Gene sets enriched in phenotype DMSO(Cell Lines: HPB-ALL KOPT-K1)

Heatmap(Cell Lines: HPB-ALL KOPT-K1)

Heatmap(NOTCH signaling pathway)

Methods

Pipeline

We have created the working environment under an i-python notebook of the jupyter platform configured to be able to execute R in code cells that start with %%R. To do so we have used the python package rpy2. This allows us to keep the documentation unified with the execution pipeline. It also becomes a good environment to launch hybrid pipelines with steps in R, python or even bash. It would not be difficult to develop on top a checkpoint and restart system for those developments highly time consuming.

The interface with python has not been necessary but we have developed in bash part of the conversions to the gct format and the generation of phenotype files (.cls). We use the expression set tranformed by expresso method from affy package as input on GSEA.

In R we used the affy, limma, and simpleaffy libraries and developed a pipeline similar to the one followed in the lectures. We group in functions the most used methods to be able to launch in a more compact way the different experiments.

Input data

We download from the GEO platform the raw data belonging to expression arrays Human Genome U133 Plus 2.0 with code GSE18198. These data have RNA information from cell cultures of the KOPT-K1 and HPB-ALL lines treated for 24h with SAHM1. Also their respective controls with the same amount of DMSO.

Analysis strategy

The quality analysis is carried out with the MAS 5.0 algorithm provided by simpleaffy.

The two cell lines were analyzed separately and jointly in R-limma and in GSEA.

In R-limma we use the Benjamini, Hochberg algorithm to control the false discovery rate derived from multiple testing and obtain adjusted p-values.

Before to fit the model in limma, we call the expresso command to got background correction, normalization and summarization to $log_2$ values.

The conversion to GSEA is done from the expression set obtained from the expresso command. Then we perform the conversions to the gct format.

The results in GSEA and R are coherent, at least in regard to our target: NOTCH signaling pathway on which we have concentrated exclusively.

Discusion

Quality

We use the simpleaffy R package that generates a series of metrics recommended by the manufacturer Affymetrix:

  1. Average background
  2. Scale factor
  3. Percentage of genes called present.
  4. 3' to 5' ratios (related with RNA degradation)

It is observed that all the indicators are within the acceptance margins (see graph of section 1.5), but that the patterns are clearly different between the samples of both cell lines.

Differential expression

We performed three different analysis to discover the effectiveness of SAHM1 in the inhibition of NOTCH.

  1. Comparison between control (DMSO) and inhibitor (SAHM1) in the HBP-ALL cell line.
  2. Comparison between control (DMSO) and inhibitor (SAHM1) in the KOPT-K1 cell line.
  3. Comparison between control (DMSO) and inhibitor (SAHM1) joint for both cell lines.

It seems more correct to isolate each cell line separately in the analysis, according to the results of the quality analysis, where the expression patterns within cell lines appear more homogeneous than between. However, in figure 3 of the paper a heatmap is shown where the 12 samples seem to have been treated together, so we reproduced this analysis in case it could really show significant differences with the individual ones.

On the HPB-ALL cell line (1.9) and on the analysis of the two lines together (1.11) we found several direct targets of Notch TF among the 50 most significantly infra-regulated probes on inhibition scenario (lowest values of adjusted p-value and logFC <0): HES1, HES4, and DTX1, which are also investigated in the article. This result is also reproduced in the parallel analysis performed on GSEA. See figures 3 & 7.

GSEA also provides hallmarks that are overexpressed in the absence of inhibitory treatment, and among them we find Notch signaling (figures 2, 4 ,6)

The ES plots of the notch signalling pathway show that their gen-set is overrepresented in the high zone of the ranking in both cell lines. Figure 1 shows the one calculated for HPB-ALL.

These results are compatible with the expected effectiveness of SAHM1 as a NOTCH inhibitor.

In the analysis on the KOPT-K1 line, the hallmark NOTCH signaling still appears overexpressed in absence of inhibitor, but there are only traces of HES4 in the GSEA heatmap.

To get more insights we need to dive into notch hallmark in GSEA analysis.

So, for GSEA analysis performed in both cell-lines, we see (figure 8) that HES1, DTX1 and NOTCH homologous are under-expressed in treated samples, confirming the paper results.

Other results

We have not included them in this document, but we have also worked against GSEA from the raw experimental data, starting directly from the affiBatch object. The conversion to gct is somewhat different and is detailed in [1.7]. We have not time to analyze this results.

Another hallmark that I analyse are the MYC_targets (figures 9 and 10). MYC is mentioned at the paper and we have found articles that relate this with NOTCH signaling ("NOTCH1 directly regulates c-MYC and activates a feed-forward-loop transcriptional network promoting leukemic cell growth", Teresa Palomero et al.).

Also it is overrepresented in the upper area of the ranking related to control cells:

Enrichment plot MYC Target Pathway (HPB-ALL)

Enrichment plot MYC Target Pathway (KNOPT-K1)

Conclusions

These results seem to confirm that the designed peptide is carrying out the desired functions related to the inhibition of the NOTCH signaling pathway, although we have not been able to delimit the targets in the same way as in the article, especially in the analysis of the cell line KOPT-K1.

It has become clear to me, as on other occasions, that differences between computational procedures can give rise to subtle and sometimes not so subtle differences between the data obtained. It is more than necessary to execute analysis with several tools, at least three. In this case we have done it with two: GSEA and R-limma, but in GSEA we do not input raw data, so being strict, we would not be following our own recommendations. Still, there are differences.

Another practice that we believe is advisable is to have a highly tested homemade version of the main algorithms. This can help to analyze the reliability of the software used. It's not necessary that this type of code be highly optimized.

Outputs

In [284]:
%%bash
jupyter nbconvert --to=latex --template=~/report.tplx TRREP_final.ipynb 1> /dev/null 2> /dev/null
/Library/TeX/texbin/pdflatex -shell-escape TRREP_final 1> /dev/null 2> /dev/null
jupyter nbconvert --to html_embed --template toc2 TRREP_final.ipynb 1> /dev/null 2> /dev/null